One Way to Create Define.xml Files
Implementation, Obstacles and
Enhancements
Authors Emöke Merli, Edith Heimsch, Dr. Elke
Sennewald
Company Kendle International Inc.
Stefan-George-Ring 6
81929 München
Contact Emöke Merli
Stefan-George-Ring 6
81929 München
Tel. +49 (0) 89 / 99 39 13 181
Fax
+49 (0) 89 / 99 39 13 124
Edith Heimsch
Stefan-George-Ring 6
81929
München
Tel.
+49 (0) 89 / 99 39 13 337
Fax +49 (0) 89 / 99 39 13 124
Dr.
Elke Sennewald
Stefan-George-Ring
6
81929
München
Tel.
+49 (0) 89 / 99 39 13 125
Fax
+49 (0) 89 / 99 39 13 124
Abstract
In their critical path initiative, the FDA underlines the
urgent need for a standardized approach to capture, receive and analyze
clinical study data. Providing the data definition document in a machine
readable format likesuch as XML increases the level of
automation and improves the efficiency of the regulatory review process. Case
Report Tabulation Data Definition Specification (CRT DDS) is the CDISC standard
for providing metadata in XML format for an electronic submission to regulatory
authorities such as the FDA.
This paper will describe how we have implemented CRT DDS
(commonly known as define.xml) Standards Version 1.0.0 at Kendle. Furthermore, it
demonstrates how Kendle uses the SAS®SAS based tool
DefinedocTM from Meta‑Xceed, Ind. (MXI)12 in the define.xml
generation process. Also quality control processes developed by Kendle will be
discussed and some interesting features of define.xml and issues encountered
with the CDISC guidelines will be highlighted.
Keywords
CDISC
define.xml
DefinedocTMDefinedoc
metadata
Introduction
After the FDA’s announcement in their Study Data Specifications11 document that define.xml is considered as the preferred submission standard for SDTM metadata, define.xml became the most important format for SDTM metadata submission, thus replacing define.pdf.
“The specification for
the data definitions for datasets provided using the CDISC SDTM is included in
the Case Report Tabulation Data Definition Specification (define.xml) developed
by the CDISC define.xml Team. The latest release of the Case Report Tabulation
Data Definition Specification is available from the CDISC web site
(http://www.cdisc.org/models/def/v1.0/index.html). Include a reference to the
style sheet as defined in the specification and place the corresponding style
sheet in the same folder as the define.xml file.” 11
But it is not only FDA’s preference that makes define.xml
the format of choice. The main advantage of the XML format is its both machine
and human readability. Moreover the XML format is platform independent which facilitates
the data transfer between many kinds of systems.
However, XML being is new to
the industry and without any XML experience hard to understand. T, the first
define.xml example1 published by CDISC in 2007 provided a very good first
insight into what define.xml is and demonstrated is about
and how the standard works.
Another good example was found iIn
2008, CDISC published the SDTM/ADaM Pilot Project7 which
was, a collaborative project between
CDISC, FDA and the industry to assess how the define.xml standard can be
implemented in a real study.
Besides the CDISC define.xml guideline5, both the
CDISC 2007 define.xml example1 and the SDTM/ADaM Pilot Project7
were very useful for Kendle while implementing a process for generating
define.xml generation process.
Define.xml Ggeneration
There are many ways to create define.xml files. Basically any format that provides some kind of interface between study data including the corresponding metadata and XML can be used to store data and metadata and derived a define.xml file from it.
One option is to use an entirely XML based solution, where
both data and metadata are stored in XML format. If all mapping and analysis processes
use SAS®, however, this is apparently
not the most efficient option, since handling XML files in SAS®
is not very convenient.
Another option possibility is
to use a SAS® based solution. Without any
additional tools though, one needs to be quite familiar with the basic XML
concept and structure to ensure correctness and completeness of the define.xml
output created by SAS®.
This paper will outlined an enriched SAS®
based approach developed at Kendle that uses SAS® to
provide all necessary data and metadata through and a
commercially available SAS®SAS based
software called DefinedocTMDefinedoc
to convert this information into
XML. While the focus will be on the creation of define.xml files for SDTM data,
the same approach can be used for ADaM data.
The starting point for this SAS®SAS
based approach is the specification document stored in as an
Excel® file.
<Insert
Figure 1 here, half page width>
Figure 1 shows a flow chart of the define.xml generation
process implemented at Kendle.
Step 1 – Specification Document
In a first step the specification
document, containing all metadata is set up. Based on this specification
document the SDTM domains are created in SAS® and
SDTM validation checks are performed to check for inconsistencies
between the CDISC SDTM Guideline and the SDTM domains. The specification
document and the SAS® domains are the basis for the
define.xml file which is created by using DefinedocTM
and supplemental SAS®SAS macros
and programs. Further checks are applied to ensure consistency between data,
metadata and define.xml. All necessary steps and checks will be discussed in more detail
below.
<Insert Figure 2 here, half page width>
Figure 2 illustrates the four steps of the define.xml
generation process with Definedoc TM and
the supplemental SAS® programs.
First step
In a first step, DefinedocTM
is run
used to extract and store all metadata
available in the SAS® domains. Figure 3 shows how the
main screen (data definition screen) of DefinedocTM which outlines
all information on project including:, study, SAS®
library names, input SAS® datasets, paths, output
directories, output files as well as variable and dataset order.
<Insert Figure 3 here, half page width>
In thea second
screen (information screen), general study information, like e.g. company name,
product name, protocol number, XPT file location and annotated CRF location can
be entered. An example of the general information screen is shown in figure 4.
<Insert Figure 4 here, half page width>
By running applying the
DefinedocTM software, the following
information is extracted from the SAS®
domains:
·
List of domains (sorted by class and domain names,
as defined by CDISC1)
· List of variables (order of appearance same as in domains)
· Variable names
· Variable labels
· Variable length
· Number of variables
· Number of records
A set and a batch of
working files (SAS®
datasets _define.sas7bdat and corresponding backup and audit trail files)
is created, as shown in figure 5, to manage which
contain the information described above.
<Insert Figure 5 here, half page width>
These working files are used to create the structure of the define.xml
file. Furthermore, a stylesheet (define.xsl) is automatically generated by
DefinedocTM.
An example for the output of this first step is shown in figure 6.
<Insert Figure 6 here, full page width>
Second step
Step 2 – Automate
Data Attributes
To include all additional information that is needed for
define.xml, SAS® programs and macros were
developed at Kendle, which are applied in the second step of
define.xml generation. The SAS®
datasets generated by DefinedocTM (see
figure 5) are enhanced with input from the specification document by merging the
following information to the datasets:
· Domain description
· Domain structure
· Domain purpose
· Domain class
· Domain keys
· Variable controlled terms or formats
· Variable origin
· Variable role
· Comments
· Value level metadata
· Nested variables
· Code lists
· Repeating attribute
The structure of the specification document used is similar to the
structure of the SAS® datasets. Typically it
consists of three parts: dataset, variable and value level metadata.
Figure 7 shows an extract from the variable metadata of the specification document.
<Insert Figure 7 here, full page width>
The code list content is stored in the Controlled Terms or Format
columns of the variable and value level sections. The column Codelist of the
Excel® specification document (see figure 7) contains the name of the
code list. Whenever this column is populated, filled DefinedocTM
creates a code list for define.xml.
Information on variable nesting is stored in the column Nested
Variable of the Excel® specification document (see figure 7). If a
variable name (e.g. LBTESTCD) is inserted into this
column, DefinedocTM captures the corresponding
nested variable from the same row (e.g. LBCAT) and creates a nesting relation
in the define.xml file. More details on the definition of nested variables and
how they are implemented in define.xml can be found in the next chapter.
Furthermore, the ODM type of the variables is determined by
SAS® algorithms searching for
special character patterns in the data (SDTM domains).
Information from all sources - SAS®
datasets created by DefinedocTM, specification
document and SDTM domains - is combined and consistency checks are run.
As a result the DefinedocTM datasets
(see figure 5) are updated to include all updated information.
An example for an updated _define.sas7bdat dataset is shown in figure 8.
<Insert Figure 8 here, full page width>
Third step
Step 3 – Generate
Updated Define.xml
Based on datasets created in the previous step DefinedocTM
is run again to create a define.xml file, which contains all information
gathered so far (see figure 9).
<Insert Figure 9 here, full page width>
Fourth step
Step 4 – Update XMl and
XSL
In a final step the define.xml and define.xsl files are slightly
modified, using SAS® programs and macros, to finalize
the define.xml layout. An example One of these modifications
is the deletion of the columns ‘Number of Variables’ and ‘Number of Records’
(see figure 6) that are generated by DefinedocTM,
but not required by CDISC. Figure 10 shows the first page of the final
define.xml file.
<Insert
Figure 10 here, full page width>
Selected Define.xml Features
In this chapter some selected features of the final define.xml file are described in more detail.
SAS®SAS
types vs. ODM types
It is commonly known that SAS®SAS
distinguishes between two variables types including:only –
character and numeric. As per CDISC ODM 1.2 guideline2 a variable
can be assigned one out of six different types – integer, float, date,
datetime, time and text.
The easiest way to determine the type of a variable is to directly
derive this information from SAS®SAS datasets
(SDTM domains) themselves. Using regular expressions, Kendle developed a SAS®SAS
program to search for special character patterns in the data values:
1. The type of all character variables of which the variable name does not end with -DTC is set to ‘text’.
2. All numeric variables are sorted into types ‘float’ or ‘integer’ depending on their number of decimal places.
3. The
type of the remaining variables (name ending with -DTC) is set to ‘date’,
‘time’, or ‘datetime’, depending on the result of the algorithm displayed in figure
11.
<Insert
Figure 11 here, full page width>
Nested variables
Nested variables are variables that are linked with to each
other. In most cases nested variables are category (–CAT) and results variables
(–TESTCD). Each category variable value corresponds to certain result variable
values and one each result
variable value corresponds to one or more category variable values. Figure 12
shows the relation between category variable LBCAT and the corresponding result
variable LBTESTCD.
<Insert Figure 12 here, full page width>
In this example the variable LBCAT is linked to the value level
metadata section of this variable;, where all possible
values of LBCAT are listed. One of these values is BIOCHEMISTRY. Clicking on
the hyperlink navigates the user to a list of all LBTESTCD values for the
category BIOCHEMISTRY.
While this functionality of define.xml was not implemented in the CDISC SDTM/ADaM Pilot Project7, it is described in the CDISC 2007 define.xml example1.
Navigation without
back button
In cooperation with Meta-Xceed, hyperlinks
were implemented to facilitate the navigation between the metadata levels (see
figure 12). Clicking on the hyperlink, LBCAT navigates
the reviewers to the value level metadata section of LBCAT. Similarly, And clicking
on the header of the value level metadata (ValueList.LB.LBCAT) navigates them back
to LBCAT in the variable level. This allows for the navigation
between levels without using the back buttons.
Automated code lists
generation
Information for variables –TESTCD, –TEST, –PARMCD, –PARM,
QNAM and QLABEL is are stored
in the value level metadata section of define.xml.,
As anFor
example, the information for –TESTCD
variables can be found in the ‘Value’ column and for –TEST variables in the
‘Label’ column respectively. Kendle implemented an automated process in SAS®SAS
to extract this information from the value level and create code lists for each
–TESTCD and –TEST variable out of it. The ‘Value’ column is
presented in the ‘Code Value’ column and the ‘Label’ column is shown in the ‘Code Text’ column. See figure
13 as an example.
<Insert
Figure 13 here, half page width>
‘Class’ column
Although the ‘Class’ column is not required by the Final
SDTM 3.1.1 Guideline1, Kendle includes the class column as described
in the Draft SDTM 3.1.2 Guideline10 in the define.xml browser
representation (see figure 10). In this case, tThe ItemGroupDef
attribute def:Class in define.xml is used to store the class information.
Adaption of SDTM
define.xml for
The structure of the SDTM and ADaM metadata is very similar.
Both models contain possess three
levels of metadata including: dataset, variable and
value level metadata. Thus, for ADaM datasets, steps one to three of the
define.xml generation process can be performed in the same way as described
above (see also figure 2) for SDTM metadata.
Just tThe stylesheetstyle
sheet used with the for the browser
representation of define.xml has to be adapted in step four:
·
At dataset level: SDTM column ‘Class’ is changed
to ‘Documentation’ for
·
At variable level: SDTM column ‘Comment’ is
changed to ‘Source’ for
· At value level: SDTM columns ‘Label’, ‘Value’ and ‘Comment’ are changed to ‘Param’, ‘Paramcd’ and ‘Source/Computational Method’ respectively
· Links to analysis results metadata (documents in PDF format) are created in the navigation bar
· Where applicable, links to additional supplemental documentation (documents in PDF format) are created in the navigation bar
When adapting SDTM define.xml for
Consistency Cchecks
When During generating
the define.xml, generation it is important to
validate not only the syntax but also the content and the semantic of the XML
file. Therefore, several check mechanisms were implemented by Kendle to ensure
correctness and CDISC compliance.
SDTM validation checks
First of all, SDTM data checks are performed that are based upon the WebSDM V1.5 edit checks as published on the CDISC standards website3 and based upon the FDA Draft Specifications for SDTM Validation Criteria 4 to evaluate the adherence to the SDTM guidelines.
Additional SDTM
validation checks were developed by Kendle
To supplement the SDTM validation checks as published on the CDISC website, Kendle implemented the following additional SDTM validation checks:
·
Identification of variables of which label
listed in domain description is not consistent with label implicit in SAS®SAS
dataset
·
Identification of variables defined as key in
(study specific) description but for which uniqueness is not present in SAS®SAS
datatsetsdatasets
· Identification of variables of which role listed in domain description is not consistent with role in (study specific) description file
·
Identification of columns with values equal null
(empty) for which (Standard) Core attribute is '
·
Identification of domain tables of which the
order of the variables in the SAS®SAS dataset
is not consistent with (study specific) description file
·
Identification of variables of which variable
length listed in (study specific) domain description is not consistent with
variable length implicit in SAS®SAS dataset
· Identification of null (empty) values found in a column for which (study specific) Core attribute is 'Req' for new domains (X-)
· Identification of values that are not unique in (study specific) domain description per category and subcategory
Define.xml syntax
checks
In addition to the SDTM validation checks, the define.xml syntax was validated against the CDISC schemas published on the CDISC define.xml5 and ODM 1.2.1 standards website6 to ensure schema compliance. The schemas are connected to each other as shown in figure 14.
<Insert Figure 14 here, half page width>
Define.xml content
checks
Besides the syntax checks, maintaining the correctness
accuracy of the define.xml file content
is
very significantof
major importance. An extract of the checklist applied by Kendle is
shown can be found below:
· Identification of variables for which the type in domains and define.xml does not match CDISC ODM 1.2 types
· Identification of variables or values for which the column ‘Controlled Terms or Format’ of define.xml does not contain code list links
· Identification of values for which nested variables are not appropriate
· Identification of values that occur in more than one category (nested variables) and check of correctness
· Identification of value lists with more than one code list name
· Identification of code list names with more than one value list
· Identification of variables which occur in more than one domain and where the variable labels does not match
· Identification of variables which occur in more than one domain and where the variable length do not match
·
Identification of variables for which ‘Core’
column does not equal REQ,
· Identification of duplicate values for domain, variable, value, category and subcategory
· Identification of domains that occur on variable metadata level but are not listed on dataset level
· Identification of variables that occur on value level but are not listed on variable level
· Identification of variables that are defined as nested variables but do not exist on variable metadata level
· Identification of implausible nested variables where the strings ‘CAT’ and ‘TESTCD’ cannot be found in the variable name
·
Identification of variables for which the
content of the Excel® specification document does not match the
content of the DefinedocTMDefinedoc
generated SAS®SAS datasets
· Identification of variables (–DUR, –DTC, –ELTM, –EVLINT) for which ISO 8601 is not correctly set in ‘Controlled Terms or Format’ column
Obstacles
While developing the Kendle approach for the define.xml file
creation, we encountered a few obstacles which might may be
of interest for thoseto everyone creating
their own standardization approach.
ODM version 1.2 vs. 1.3
When define.xml version 1.0.0 was released in 2005, ODM version was 1.2 was the most recent version. Thus, define.xml being an extension to ODM, it is based on ODM 1.2. In the meantime ODM was further developed and the updated ODM version 1.3 was released in 2006. However, this enhancement has not yet been considered in the define.xml schemas. These schema is still referenced the ODM 1.2 (see figure 15).
<Insert Figure 15 here, full page width>
While define.xmlmxl version
1.0.0 is not yet replaced by the next version,
ODM 1.2 should be used for electronic submission to the FDA. Otherwise, the FDA
will not be able to read the define.xml file9.
‘Controlled Terms or
Format’ column
There are discrepancies between the CDISC SDTM guideline1
and the CDISC define.xml guideline5 with respect to the use of the
‘Controlled Terms or Format’ column. On the one hand side, the
CDISC SDTM guideline states that the content of this column could be any kind
of text. On the other hand side, the define.xml guideline only
allows for the use of code lists in this column. Thus, the user is forced to include
all information in form of code lists9.
‘Repeating’ attribute
The CDISC ODM and define.xml guidelines do not clearly define the use of the ‘Repeating’ attribute, especially for trial design domains such as: TA, TE, TI, TS and TV. From the authors’ point of view, the most logical approach is to define this attribute as follows9:
· Repeating=“Yes” for domains with more than one record per unique subject identifier
· Repeating=”No” for domains with one record per unique subject identifier
·
Repeating=”No” for all trial design domains
Conclusion
When generating define.xml for SDTM datasets, not only should
you refer to the define.xml guideline, but you should also
review
the essential SDTM and ODM guidelines are
essential. The SDTM guideline describes the structure of the data
(SDTM domains), and therefore thus is the basis
for the documentation of metadata. Besides
many other features, ODM stores the metadata pertaining to
the SDTM domains and define.xml asis an
extension to ODM which enriches it by many elements and attributes.
As the ODM and SDTM guidelines have significantly evolved
since its previous the last define.xml
version release in 2005, an update of the define.xml guideline is more than
overdue.
The FDA Study Data Specifications11 allows for
metadata to be submitted as define.xml file and for SDTM data as SAS®SAS
XPT files. The next, most logical step would be to also allow for SDTM data to be
submitted in XML format. This way, both, data and
metadata could be submitted in a single format;,
thus simplifying the whole submission process for industry and regulatory
agencies as well.
It also worth mentioning, that the
demand for a define.xml ADaM guideline is growing steadily since the publication
of the CDISC SDTM/ADaM Pilot Project7.
With the introduction of the HL7-XML standard, however, further developments of define.xml remain to be seen.
Acknowledgements
The authors would like to thank Sy Truong from
References
1. CDISC Metadata Submission Guidelines, Appendix to the Study Data Tabulation Model (SDTM) Implementation Guide 3.1.1, Draft Version 0.9, 25.Jul.2007 (http://www.cdisc.org/models/sdtm/v1.1/index.html)
2. CDISC Specification for the Operational Data Model (ODM), Version 1.2, January 2004 (http://www.cdisc.org/models/odm/v1.2/ODM1-2-0.html)
3. Validation checks performed by WebSDMTM on SDTM version 3.1.1 datasets, Version 1.5, 12.Apr.2007 (http://www.phaseforward.com/products/safety /documents/ValidationChecksPerformedbyWebSDMtm.Q107.pdf)
4. FDA Draft Specifications for SDTM Validation Criteria (v3.1, v3.1.1), Version 0.1, 01.Sep.2008 (http://www.fda.gov/oc/datacouncil/janus_sdtm_validation _specification_v1.pdf)
5. CDISC Case Report Tabulation Data Definition Specification (CRT-DDS, also called define.xml), Version 1.0, 10.Feb.2005 (http://www.cdisc.org/models/def/ v1.0/index.html)
6. CDISC Operational Data Model, Final Version 1.2.1, January 2005 (http://www.cdisc.org/models/odm/v1.2.1/index.html)
7. CDISC SDTM/ADaM Pilot Project (http://www.cdisc.org/membersonly /members_sdtm.html)
8. CDISC Operational Data Model, Final Version 1.3, latest change data 19.Dec.2006 (http://www.cdisc.org/models/odm/v1.3/final/ODM1-3-0-Final.htm)
9. CDISC
Public Discussion Forum, Case Report Tabulation
(CRT-DDS or define.xml) (http://www.cdisc.org/discussions/index.html)
- Case Report Tabulation (CRT-DDS or define.xml)
Thread: ODM 1.2 or ODM 1.3, October 2008
Thread: Controlled Terms or Format Column, October 2008
- ODM V1.3 Final
Thread: Repeating Attribute, October 2008
10. CDISC Study Data Tabulation Model (SDTM) Implementation Guide, Draft Version 3.1.2, 10.Jul.2007 (http://www.cdisc.org/models/sdtm/v1.2/index.html)
11. FDA Study Data Specifications, Version 1.4, 01.Aug.2007 (http://www.fda.gov/CDER/regulatory/ersr/Studydata.pdf)
12.
Figures
Figure 1: Flow chart for define.xml generation process
Figure 2: DefinedocTMDefinedoc,
supplemental SAS®SAS programs
and define.xml checks
Figure 3: DefinedocTMDefinedoc
– Data Definition Screen
Figure 4: DefinedocTMDefinedoc
– General Information Screen
Figure 5: List of working files
Figure 6: First step of define.xml generation, DefinedocTMDefinedoc
output, structure of the define.xml file
Figure 7: Variable metadata of the specification document
Figure 8: Second step of define.xml generation, result of SAS®SAS
program run; all metadata included in a single SAS®SAS
dataset
Figure 9: Third step of define.xml generation, DefinedocTMDefinedoc
output with all metadata included
Figure 10: Fourth step of define.xml generation, final
define.xml
Figure 11: SAS®SAS
algorithm for type determination
Figure 12: Nested variables and hyperlinks
Figure 13: Automated code list generation
Figure 14: Relation between CDISC define.xml and ODM schemas
Figure 15: Reference from define.xml schemas to ODM 1.2.1
schemas.